Risk bounds for time series without strong mixing 
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, Abstract 

> , 

We show how to control the generalization error of time series models wherein past 
values of the outcome are used to predict future values. The results are based on a 
generalization of standard IID concentration inequalities to dependent data. We show 
how these concentration inequalities behave under different versions of dependence to 
■ provide some intuition for our methods. 
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1 Introduction 

Much of the literature in machine learning focuses on studying the behavior of predictions 
constructed based on a training set {X\, Y\ ),■■■, (X n , Y n ) where one wishes to construct 
a mapping from X to Y . This training set may consist of n IID draws from a common 
distribution, or it may have some dependence property such as ergodicity or mixing behavior 
[si 0, 0]- It may even be generated by an adversary intent on deceiving us about the 
relationship [1, 10]. 

Time series data are different. We observe only a single sequence of random variables 
Y" = (Yi, . . . ,Y n ) taking values in a measurable space y and wish to learn a function 
which takes the past observations as inputs and predicts the future. Suppose, given data 
from time 1 to time n, we wish to predict time n + h for some h £ N. Then for some loss 
function i : y x y — > M + , and some predictor g : y n — > y, we define the prediction risk, or 
generalization error, as 

R(g):=E[£(Y n+h ,g(Y?)]. (1) 

Here we assume that the data series is stationary, a notion to be defined more precisely 
later. But this allows us to have some hope of controlling the generalization error defined 
in (TTJ. Absent this sort of behavior, the past and future could be unrelated. 

Since the true distribution is unknown, so is R(g), but we can attempt to estimate it 
based on only our observed data. In situations with predictors X and responses Y, there is 
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the obvious estimator 

1 - 

Rn(g) :=-^£(Yi,g(Xi)). 

i=l 

However, in this case, we may use some or all of the past to generate predictions, and 
similarly, it may be that we have not observed Y+h for some i. To ease notation for the 
remainder of the paper, assume that we have observed some sequence of data Y\, . . . , Y n+ j 
for j G N such that it is possible to evaluate the quantity l(Yi+h-> 9(Xi, . . . ,Y)) for each 

1 6 {1, ... ,«}. For time series prediction, we define the training error as 

1 - 

Rn{g):=-Y,^ Y i+h,g{Y\). (2) 

i=l 

Here g is some function chosen out of a class of possible functions Q. 

Choosing a particular prediction function g as the minimizer of R n over Q is "empirical 
risk minimization" (ERM); this often gives poor results because the choice of g adapts to 
the training data, causing the training error to be an over-optimistic estimate of the true 
risk. Additionally, training error must shrink as model complexity grows so that ERM will 
tend to overfit the data and give poor out-of-sample predictions. 

While R n (g) converges to R(g) for many algorithms, one can show that when g minimizes 
([2]), W,[R n (g)} < R(g)- There are a number of ways to mitigate this issue. The first is to 
restrict the class Q. The second is to change the optimization problem, penalizing model 
complexity. Rather than attempting to estimate R(g), we provide bounds on it which hold 
with high probability across all possible prediction functions g £ Q. A typical result in this 
literature is a confidence bound on the risk which says that with probability at least 1 — 5, 

R(g)<R n (g)+r(C(g),n,8), 

where C(-) measures the complexity of the model class Q, and T(-) is a function of the 
complexity, the confidence level, and the number of observed data points. 

In $2j we provide some background material necessary to characterize our results, in- 
cluding some concentration inequalities for dependent data. Section [3] derives risk bounds 
for time series and gives a novel proof that the standard Rademacher complexity charac- 
terizes the flexibility of Q. Section [4] supplies some straightforward examples showing how 
dependence affects the quality of bounds. Section [5] concludes and provides some ideas 
about the future of these results. 

2 Time series, complexity, and concentration of measure 

In this section, we introduce some of the math necessary to develop our results: stationarity 
is a prerequisite for control of generalization error; Rademacher complexity measures the 
flexibility of the model space Q; dependence modifies concentration inequalities. 

Throughout what follows, Y = {Y t }^S = _ 00 will be a sequence of random variables, i.e., 
each Yt is a measurable mapping from some probability space (Q, J 7 , P) into a measurable 
space 3^. A block of the random sequence will be written ~Yj = {Y^ }^ =i , where either limit 
may go to infinity. The a-field generated by a particular block ~Yj will be given by T\ . 
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2.1 Time series 



The dependent data setting we investigate is based on stationary time series input data. 
We first remind the reader of the notion of (strict or strong) stationarity. 

Definition 2.1 (Stationarity). A sequence of random variables Y is stationary when all 
its finite-dimensional distributions are invariant over time: for all t and all non-negative 
integers i and j, the random vectors Y^ +J and Y*^* +: ' have the same distribution. 

Stationarity does not imply that the random variables Yj are independent across time 
t, only that the distribution of Yj is constant over time. 



2.2 Rademacher complexity 



Statistical learning theory provides several ways of measuring the complexity of a class of 
predictive models. The results we use rely on Rademacher complexity (see, e.g., [H), which 
measures how well the model can (seem to) fit white noise. 

Definition 2.2 (Rademacher Complexity). Let Y™ be a time series drawn according to a 
joint distribution v. The empirical Rademacher complexity is 



Vi n {G) := 2E C 



sup 

geg 



1 n 
n ^ 



^g(Y{) 



i=l 



yn 



where o~i are a sequence of random variables, independent of each other and everything else, 
and equal to +1 or — 1 with equal probability. The Rademacher complexity is 



where the expectation is over sample paths Y™ generated by v . 

The term inside the supremum, ~ Y17=l a i90^\)\j * s the sam ple covariance between the 
noise a and the predictions of a particular model g. The Rademacher complexity takes the 
largest value of this sample covariance over all models in the class (mimicking empirical risk 
minimization), then averages over realizations of the noise. 

Intuitively, Rademacher complexity measures how well our models could seem to fit 
outcomes which were really just noise, giving a baseline against which to assess the risk 
of over-fitting or failing to generalize. As the sample size n grows, for any given g the 
sample covariance 1^ Yli=\ a i90^\)\ ~~ 0) by the ergodic theorem; the overall Rademacher 
complexity should also shrink, though more slowly, unless the model class is so flexible that 
it can fit absolutely anything, in which case one can conclude nothing about how well it 
will predict in the future from the fact that it performed well in the past. 



2.3 Concentration inequalities 

For IID data, the main tools for developing risk bounds are the inequalities of Hoeffding 0] 
and McDiarmid [sj. Instead, we will use dependent versions of each which generalize the 
IID results. These inequalities are derived in van de Geer 12|]. They rely on constructing 
predictable bounds for random variables based on past behavior, rather than assuming a 
priori knowledge of the distribution. 



3 



Theorem 2.3 (van de Geer [12j Theorem 2.5). Consider a random sequence Y" where 

Li < Yi < Ui a.s. for all i > 1, 
where Li < f/j are F 1 ^ 1 -measurable random variables, i > 1. Define 

n 
i=l 

wit/i t/ie convention Cq = 0. T/ien /or all e > 0, c > 0, 

P ^^^^i > 6 anc ^ Cn ^ c 2 / or so^e < exp | — ^ . 

Of course if L{ and C/j are non-random, this returns the usual Hoeffding inequality. Here 
however, they must only be forecastable given past values of the random sequence. 



Theorem 2.4 (van de Geer [12| Theorem 2.6). Fix n > 1. Let Z n be J 7 ^ -measurable such 
that 

Li < E[Z n | F{] < Ui, a.s. 
where Li < Ui are J-^ 1 -measurable. Define C 2 as above. Then for all e > 0, c > 0, 

P (Z n - E[Z n ) > e and C 2 n < c 2 ) < exp |-^-| . 

To see how this generalizes McDiarmid's inequality, we provide the following corollary. 
Corollary 2.5. Let g(Y\, . . . , Y n ) be some real valued function on y n such that 



E[g(Yi, . .. ,Y n ) | T\] - E[5(Yi, . .. ,Y n ) \ J\ 



■i-li 



< h (3) 



where ki is T\ 1 -measurable. Then, 



^g(Y 1 ,...,Y n )-E[g(Y 1 ,...,Y n )] >e and < c^J <exp|- 



2e2 
c 2 



In particular, this gives a couple of immediate consequences. Suppose that g is bounded. 
Then, we have that 

ki < supsup \g(Y l7 . ...)) ..)) Y n ) -g{Y u ..., . . . , XL)\ = h. 

i ? 

This contrasts with the bounded differences inequality in the IID case, wherein one only 
needs to be concerned with one point that is different. For IID data, we have starting from 

h < sup \g(Y 1 ,...,Yi_ 1 ,Yi,...,Y n )-g(Y 1 ,...,Yi_ 1 ,Yj,...,Y n )\ = di, 

if g satisfies bounded differences with constants di. In other words, Theorem 12.41 conflates 
dependence with nice functional behavior. 
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3 Risk bounds 



Generalization error bounds follow from deriving high probability upper bounds on the 
quantity 

Q n (H) := sup (R(h)-Rn(h)), 
hen 

which is the worst case difference between the true risk R(h) and the empirical risk R n (h) 
over all functions in the class of losses W. = {h = £(-,g(-)) : g £ Q} defined over a particular 
class of prediction functions Q. In the case of time series, Q n (h) is J^-measurable, so we 
can get risk bounds from Theorem 12.41 if we can find suitable Li and Ui sequences. 

Theorem 3.1. Suppose that Q n {TL) satisfies the forecastable boundedness condition of The- 
orem \2.4\ Then, 



FUi(h) <Rn(h)+E[Q n (H)] + c^/ 1 ^^- or C 2 n > cj < 1 - 6. 

In many cases (as in the examples below), C 2 will be deterministic, in which case, the 
result above is greatly simplified. Essentially, the theorem says that as long as each new 
Yi gives us additional control on the conditional expectation of Q n , we can ensure that 
with high probability, our forecasts of the future will have only small losses. The proof is 
straightforward: simply set the right hand side of Theorem 12.41 to 5 and use DeMorgan's 
law. 

Since E[Q n (7i)] is a complicated and unintuitive object, we upper bound it with the 
Rademacher complexity. The standard symmetrization argument for the IID case does not 
work, but, for time series prediction (as opposed to the more general dependent data case 
or the online learning case), Rademacher bounds are still available. We provide this result 
now. 

Theorem 3.2. For a time series prediction problem based on a sequence Y™, 

E[Q n (H)] < m n (H). (4) 

The standard way of proving this result in the IID case is through introduction of a 
"ghost sample" Y™ which has the same distribution as Y™. Taking empirical expectations 
over the ghost sample is then the same as taking expectations with respect to the distribution 
of Y™. Randomly exchanging with Yi by using Rademacher variables allows for control 
of E[Q n (%)] and leads to the factor of 2 in Definition 12.21 However, in the dependent data 
setting, this is not quite so easy. 

For dependent data, both the ghost sample and the introduction of Rademacher variables 
arise differently. A similar situation also occurs in the more complex cases of online learning 
with a (perhaps constrained) adversary choosing the data sequence. It is covered in depth 



in Rakhlin et al. [lC], [llf]. With dependent data we need a different version of the "ghost 



sample" than that used in the IID case. First, we rewrite the left side of (j3J): 



E Y [Qn(W)] = E Y 
= E Y 



sup (R n {h) - Rnih) 
hen v 



/ n 

sup E Yn+h [£(Y n+h ,g(Y?))] - - ^(y i+1 , 5 (Y*)) 



(5) 
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Figure 1: This figure displays the tree structures for Z(er) and Z'(cr). The path along each 
tree is determined by one e sequence, interleaving the "past" between paths. 



Here, we define Zj = (Yi+h, YJ) so that h(zi) = £(Yi + h, g(Y\)) for some g G Q. At this point, 
following fiol . 11], we introduce a "tangent sequence" Z' rather than the ghost sample. We 
construct it recursively as follows. Let, 



C{Y{) = C{Y 1 ) and C{Y!\Y U . . . , Y t „ 

where C denotes the probability law. Then, let Z = (zi, . . 
Proof of Theorem \3.S\ Starting from ([5]) we have 



i and Z' : 



• • > Yi_i), 
(z[, . . . , z' n ). 



E[Q n (H)} = E Z 



E z 



sup 

hen 



sup 

hen 



E 2 



~X>(^) 



n 




(6) 



Here we have constructed Z' as a tangent sequence to Z as discussed above. Then, 

1 



< E 



Z,Z' 



sup 

hen 



n 



E 



K4) 



E^iE 
«i 414 



!«1 



1 " 
sup - V /i(z-) 



(Jensen) 



(7) 



Now, due to dependence, Rademacher variables must be introduced carefully as in the 
adversarial case. Rademacher variables create two tree structures, one associated to the Z 
sequence, and one associated to the Z' sequence (see [10|, [ll( for a thorough treatment). 
We write these trees as Z(<r) and Z'(cr), where u is a particular sequence of Rademacher 
variables (e.g. (1, — 1, — 1, 1, . . . , 1)) which creates a path along each tree. For example, 
consider cr = 1. Then, Z(cr) = (z\, . . . , z n ) and Z'(cr) = (z[, . . . , z' n ), the "right" path of 
both tree structures. For cr = —1. Then, Z(er) = (z[, . . . , z' n ) and Z'(cr) = (z\, . . . , z n ), 
the "left" path of both tree structures. Changing e« from +1 to —1 exchanges Z( for z^ in 
both trees and chooses the left child of Zj_i and z' i _ 1 rather than the right child. Figure Q] 
displays both trees. In order to talk about the probability of Zi conditional on the "past" 
in the tree, we need to know the path taken so far. For this, we define a selector function 




1 



-1. 
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Distributions over these trees then become the objects of interest. 

In the time series case, as opposed to the online learning scenario, the dependence 
between future and past means the adversary is not free to change predictors and responses 
separately. Once a branch of the tree is chosen, the distribution of future data points is 
fixed, and depends only on the preceding sequence. Because of this, the joint distribution 
of any path along the tree is the same as any other path, i.e. for any two paths cr, a' 

£(Z(<t)) = £(Z(er')) and £( Z V)) = £(ZV))« 

Similarly, due to the construction of the tangent sequence, we have that £(Z(<x)) = £(Z'(<x)). 
This equivalence between paths allows us to introduce Rademacher variables swapping Zi 
for z[ as well as the ability to combine terms below: 



EziE (71 E^ 2 | x ( (7l)Zl! ^)E CT2 • • - E 



E 



Z,Z',CT 



4lx(ffi>2a>*i) 



sup 

hen n 



X(°"n-l), 



h(zi)) 



,x(^i) E ^ 

• X(°-l) 



1 U 

sup ~y2<Ti(h(zi) 



hen n 



h(zi)) 



<E Z „ 



1 



sup — y o~ih{z. 



+ E 



Z',cr 



1 



sup -y^Oihiz'i) 



2E 



Z,CT 



1 71 

sup — > 
hennf-; 



(Tih(Zi 



□ 



Good control of ~E[Q n (T-L)] through the Rademacher complexity therefore implies good 
control of the generalization error. Rademacher complexity is easy to handle for wide ranges 
of learning algorithms using results in [1( and elsewhere. Support vector machines, kernel 
methods, and neural networks all have known Rademacher complexities. Furthermore, Lip- 
schitz composition arguments in 0] allow us to deal only with the Rademacher complexity 
of the function class Q rather than the induced loss class H. For loss functions I which are 
(^>-Lipschitz in their second argument, D\(H) < 2<fi*R(G). 

The main issue then in the application of Theorem 13.11 is the determination of the 
forecastable bounds Li and Ui from the data generating process. In the next section, we 
provide a few simple examples to aid intuition. 



4 Examples 

We consider three different examples which should aid the reader in understanding the 
nature of the forecastable bounds. Here we present two extreme cases — independence and 
complete dependence — as well as an intermediate case. It is important to note that C% is 
deterministic in all three cases, though this need not be the case. 



4.1 Independence 

For IID data, we simply recover IID concentration results. As noted in Corollary 12.51 f° r 
IID data, bounded differences yields good control. Similarly, Theorem 12.31 gives the same 
results as Hoeffding's inequality for IID data. Dependence is more interesting. 
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4.2 Complete dependence 

Let Y™ be generated as follows: 

yi~Z7(a,6), b>a Y i = Y i _ l , i>2. 

Consider trying to predict the mean - Y^i=\ Then, given no observations, the almost 
sure upper bound U\ = b while the lower bound L\ = a. So {U\ — L\) 2 = (b — a) 2 . For 
i > 1, conditional on T 1 ^ 1 (and therefore T\), U{ = L{. Thus, C 2 = (b — a) 2 giving the 
entirely useless result: 

F^gy,-(6 + a)/2> e )<exp{-^}. 

The right side is independent of n implying that we essentially observed one data point 
regardless of n. 

4.3 Partial dependence 

Let Y™ be generated as follows: 

*o = 0, Y i = 9Y i „ 1 + r H i>2, 

where 9 £ (0, 1) and rji *~ U(a, b) with b > a. Again, consider trying to predict the mean 
- Ya=i Yi- We can define and U as follows: 

L l = - 1 —^ + -Y^Y k + 0Y^ 1 , ^ = -^7T + ~I> + 0Y *-i- 
n 1 — t) n ^— ' n 1 — t) n 

k=l k=l 

From this, we have that 

(p-af_ 
n(l-9) 2 ' 

Therefore, by Theorem 12.41 

,(it,-(H „ 2 > f )<» P {-f^}. 
For comparison, if everything was IID, Hoeffding's inequality gives 

p(i|>-( 6 + )/2>e)<exp{-^}. 

Therefore, the dependence in Y™ reduces the effective sample size by (1 — 9) 2 . If 9 = 1/2, 
then each additional datapoint decreases the probability of a bad event by only a 1/4 relative 
to the IID scenario. 
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5 Discussion 



In this paper, we have demonstrated how to control the generalization of time series predic- 
tion algorithms. These methods use some or all of the observed past to predict future values 
of the same series. In order to handle the complicated Rademacher complexity bound for 
the expectation, we have followed the approach used in the online learning case pioneered 
by Rakhlin et al. [mm, but we show that in our particular case, much of the structure 
needed to deal with the adversary is unnecessary. This results in clean risk bounds which 
have a form similar to the IID case. 

The main issue with risk bounds for dependent data is that they rely on complete 
knowledge of the dependence for application. This is certainly true in our case in that we 
need to know how to choose Ui and Lj such that we almost surely control E,[Q n (T~L)]. For the 
standard case of bounded loss, there are trivial bounds, but these will not give the necessary 
dependence on n which would imply learnability of good predictors. More knowledge of the 
dependence structure of the process is required, though this is in some sense undesirable. 
However, previous results in the dependent data setting, such as those presented in 
9j, also have this requirement^] They rely on precise knowledge of the mixing behavior of the 
data which is unavailable. At the same time, mixing characterizations are often unintuitive 
conditions based on infinite dimensional joint distributions. Our version depends only on 
the ability to forecastably bound expectations given increasing amounts of data. 
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